This is triangulaR v.0.0.1
/\
/ \
/ \
/______\
Usage information is available at: https://github.com/omys-omics/triangulaR/
Please cite the following if you use triangulaR in a publication:
Code
library(vcfR)
***** *** vcfR *** *****
This is vcfR 1.14.0
browseVignettes('vcfR') # Documentation
citation('vcfR') # Citation
***** ***** ***** *****
Registered S3 method overwritten by 'pegas':
method from
print.amova ade4
Attaching package: 'pegas'
The following object is masked from 'package:ape':
mst
The following object is masked from 'package:ade4':
amova
The following objects are masked from 'package:vcfR':
getINFO, write.vcf
Code
#read in datav<-read.vcfR("~/Desktop/grosbeak.rad/grosbeak.filtered.snps.vcf.gz")
Scanning file to determine attributes.
File attributes:
meta lines: 44796
header_line: 44797
variant count: 51595
column count: 147
Meta line 1000 read in.
Meta line 2000 read in.
Meta line 3000 read in.
Meta line 4000 read in.
Meta line 5000 read in.
Meta line 6000 read in.
Meta line 7000 read in.
Meta line 8000 read in.
Meta line 9000 read in.
Meta line 10000 read in.
Meta line 11000 read in.
Meta line 12000 read in.
Meta line 13000 read in.
Meta line 14000 read in.
Meta line 15000 read in.
Meta line 16000 read in.
Meta line 17000 read in.
Meta line 18000 read in.
Meta line 19000 read in.
Meta line 20000 read in.
Meta line 21000 read in.
Meta line 22000 read in.
Meta line 23000 read in.
Meta line 24000 read in.
Meta line 25000 read in.
Meta line 26000 read in.
Meta line 27000 read in.
Meta line 28000 read in.
Meta line 29000 read in.
Meta line 30000 read in.
Meta line 31000 read in.
Meta line 32000 read in.
Meta line 33000 read in.
Meta line 34000 read in.
Meta line 35000 read in.
Meta line 36000 read in.
Meta line 37000 read in.
Meta line 38000 read in.
Meta line 39000 read in.
Meta line 40000 read in.
Meta line 41000 read in.
Meta line 42000 read in.
Meta line 43000 read in.
Meta line 44000 read in.
Meta line 44796 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
Character matrix gt rows: 51595
Character matrix gt cols: 147
skip: 0
nrows: 51595
row_num: 0
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant 23000
Processed variant 24000
Processed variant 25000
Processed variant 26000
Processed variant 27000
Processed variant 28000
Processed variant 29000
Processed variant 30000
Processed variant 31000
Processed variant 32000
Processed variant 33000
Processed variant 34000
Processed variant 35000
Processed variant 36000
Processed variant 37000
Processed variant 38000
Processed variant 39000
Processed variant 40000
Processed variant 41000
Processed variant 42000
Processed variant 43000
Processed variant 44000
Processed variant 45000
Processed variant 46000
Processed variant 47000
Processed variant 48000
Processed variant 49000
Processed variant 50000
Processed variant 51000
Processed variant: 51595
All variants processed
Code
v
***** Object of Class vcfR *****
138 samples
1027 CHROMs
51,595 variants
Object size: 532.6 Mb
2.987 percent missing data
***** ***** *****
#read in sampling dataframesamps<-read.csv("~/Desktop/grosbeak.data.csv")samps<-samps[samps$passed.genomic.filtering =="TRUE",]#make popmappm<-data.frame(id=samps$sample_id,pop=c(rep("P1", times=9),rep("P0",times=8),rep("hyb",times=121)))# Create a new vcfR object composed only of sites above the given allele frequency difference thresholdfixed.vcf <-alleleFreqDiff(vcfR = v, pm = pm, p1 ="P0", p2 ="P1", difference =1)
[1] "266 sites passed allele frequency difference threshold"
Code
# Calculate hybrid index and heterozygosity for each sample. Values are returned in a data.framehi.het.test <-hybridIndex(vcfR = fixed.vcf, pm = pm, p1 ="P0", p2 ="P1")
[1] "calculating hybrid indices and heterozygosities based on 266 sites"
Warning in geom_segment(aes(x = 0.5, xend = 1, y = 1, yend = 0), color = "black"): All aesthetics have length 1, but the data has 138 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_segment(aes(x = 0, xend = 0.5, y = 0, yend = 1), color = "black"): All aesthetics have length 1, but the data has 138 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_segment(aes(x = 0, xend = 1, y = 0, yend = 0), color = "black"): All aesthetics have length 1, but the data has 138 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Code
#see where the fixed differences are coming from:table(fixed.vcf@fix[,1])
#overwhelmingly, the fixed differences are coming from two scaffolds: "VZSJ01000457.1" (24) and "VZSJ01000270.1" (11). Let's look at the dynamics of those scaffolds:
0.2 subset the data
Code
### plot without these two over-represented scaffolds includedauto.fixed.vcf<-fixed.vcf[fixed.vcf@fix[,1] !="VZSJ01000457.1"& fixed.vcf@fix[,1] !="VZSJ01000270.1",]auto.fixed.vcf
***** Object of Class vcfR *****
138 samples
128 CHROMs
231 variants
Object size: 7.2 Mb
3.89 percent missing data
***** ***** *****
Code
# Calculate hybrid index and heterozygosity for each sample. Values are returned in a data.framehi.het.test <-hybridIndex(vcfR = auto.fixed.vcf, pm = pm, p1 ="P0", p2 ="P1")
[1] "calculating hybrid indices and heterozygosities based on 231 sites"
Warning in geom_segment(aes(x = 0.5, xend = 1, y = 1, yend = 0), color = "black"): All aesthetics have length 1, but the data has 138 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_segment(aes(x = 0, xend = 0.5, y = 0, yend = 1), color = "black"): All aesthetics have length 1, but the data has 138 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_segment(aes(x = 0, xend = 1, y = 0, yend = 0), color = "black"): All aesthetics have length 1, but the data has 138 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Code
### plot just those scaffoldsz.fixed.vcf<-fixed.vcf[fixed.vcf@fix[,1] =="VZSJ01000457.1"| fixed.vcf@fix[,1] =="VZSJ01000270.1",]z.fixed.vcf
***** Object of Class vcfR *****
138 samples
2 CHROMs
35 variants
Object size: 4.9 Mb
2.795 percent missing data
***** ***** *****
Code
# Calculate hybrid index and heterozygosity for each sample. Values are returned in a data.framehi.het.test <-hybridIndex(vcfR = z.fixed.vcf, pm = pm, p1 ="P0", p2 ="P1")
[1] "calculating hybrid indices and heterozygosities based on 35 sites"
Warning in geom_segment(aes(x = 0.5, xend = 1, y = 1, yend = 0), color = "black"): All aesthetics have length 1, but the data has 138 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_segment(aes(x = 0, xend = 0.5, y = 0, yend = 1), color = "black"): All aesthetics have length 1, but the data has 138 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_segment(aes(x = 0, xend = 1, y = 0, yend = 0), color = "black"): All aesthetics have length 1, but the data has 138 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
#create sample sheet with Z chrom inversion state includedzinv<-data.frame(id=colnames(test.only@gt)[-1],inv.state=c("RB","BH",rep("RB",4),"BH","het","BH",rep("het",3),rep("RB",13),"het",rep("RB",28),"backcrossed RB","RB","het","RB","RB",rep("BH",17),"het",rep("BH",9),"het",rep("BH",46),"RB",rep("BH",4)))
The fact that all samples seem to be either completely homozygous or completely heterozygous for fixed differences on these scaffolds suggests that this is possibly an inversion, where the whole region is inherited as a haplotype with no recombination. These two plots are highly similar, suggesting that these two scaffolds are potentially linked. To assess this, I am going BLAST them to the chromosome assembled Zebra Finch reference genome.
0.3 blast results
Code
#Blasting > 100k base-pairs of each of these two interesting scaffolds against the Zebra Finch reference genome revealed highly significant hits for both scaffolds on the Z chromosome:knitr::include_graphics(c("/Users/devonderaad/Desktop/blast1.png"))
#the scaffolds span at least 4 Mb, from 59 Mb to 63 Mb on the Zebra finch Z scaffold, suggesting a large inversion present in the center of the Z chromosome for these two Pheucticus taxa
0.4 make trees from the putative inverted region
Code
#Isolate Zz.only<-v[v@fix[,1] =="VZSJ01000457.1"| v@fix[,1] =="VZSJ01000270.1",]#get info for this datasetz.only
***** Object of Class vcfR *****
138 samples
2 CHROMs
554 variants
Object size: 10.2 Mb
2.814 percent missing data
***** ***** *****
Code
#z.only<-min_mac(z.only,2)#convert to genlightgen<-vcfR2genlight(z.only)#fix sample names to fit in <= 10 charactersgen@ind.names
pop(gen)<-gen@ind.names#assign populations (a StaMPP requirement)gen@pop<-as.factor(gen@ind.names)#generate pairwise divergence matrixsample.div <-stamppNeisD(gen, pop =FALSE)#export for splitstree#stamppPhylip(distance.mat=sample.div, file="~/Desktop/grosbeak.rad/z.only.splits.txt")library(poppr)
This is poppr version 2.9.6. To get started, type package?poppr
OMP parallel support: unavailable
Code
ploidy(gen)<-2genetic_distance_matrix <- poppr::bitwise.dist(gen, mat =TRUE)#plot the two approaches to calculating genetic divergenceplot(nj(genetic_distance_matrix), type="unrooted")
Code
plot(nj(sample.div), type="unrooted")
Code
#both trees show the expected pattern (three discrete clusters)
0.5 add genomic ancestry info to dataframe
Code
#read in admixture results#setwd to the admixture directory you brought in from the clustersetwd("~/Desktop/grosbeak.rad/admixture.mac")#read in input filesampling<-read.table("binary_fileset.fam")[,1]#get list of input samples in order they appearsampling
#read in all ten runs and save each dataframe in a listruns<-list()#read in log filesfor (i in1:10){ runs[[i]]<-read.table(paste0("binary_fileset.", i, ".Q"))}#isolate run 2 (best according to CV)run2<-runs[[2]]#add sample info in the correct order (same as input vcf)run2$sample<-colnames(v@gt)[-1]#reordersamps$sample_id == run2$sample #check if sample info table order matches the vcf
#add q-values to the sampling data.frame now that orders matchsamps$mac.lud.q<-run2$V1samps$mac.mel.q<-run2$V2#check to see that order matches with the z-inversion dataframe I made abovesamps$sample_id == zinv$id #check if this worked
#check out the one weird sample with 50/50 ancestry, 50/50 phenotype, RB Z inversion and BH mtDNAsamps[samps$zinv.hap =="RB"& samps$mtDNA ==1,]
sample_id institution catalognumber field.number sex ossification
42 P_hybrid_44696 DMNS 44696 GMS-2277 male 100
NA <NA> <NA> NA <NA> <NA> <NA>
NA.1 <NA> <NA> NA <NA> <NA> <NA>
NA.2 <NA> <NA> NA <NA> <NA> <NA>
NA.3 <NA> <NA> NA <NA> <NA> <NA>
NA.4 <NA> <NA> NA <NA> <NA> <NA>
NA.5 <NA> <NA> NA <NA> <NA> <NA>
NA.6 <NA> <NA> NA <NA> <NA> <NA>
NA.7 <NA> <NA> NA <NA> <NA> <NA>
NA.8 <NA> <NA> NA <NA> <NA> <NA>
notes verbatimeventdate stateprovince verbatimlocality
42 39230 South Dakota SD; Lyman County; Oacoma 4mi NE
NA <NA> NA <NA> <NA>
NA.1 <NA> NA <NA> <NA>
NA.2 <NA> NA <NA> <NA>
NA.3 <NA> NA <NA> <NA>
NA.4 <NA> NA <NA> <NA>
NA.5 <NA> NA <NA> <NA>
NA.6 <NA> NA <NA> <NA>
NA.7 <NA> NA <NA> <NA>
NA.8 <NA> NA <NA> <NA>
decimallatitude decimallongitude
42 43.7264 -99.47388
NA NA NA
NA.1 NA NA
NA.2 NA NA
NA.3 NA NA
NA.4 NA NA
NA.5 NA NA
NA.6 NA NA
NA.7 NA NA
NA.8 NA NA
scientificname site recordnumber
42 Pheucticus melanocephalus x Pheucticus ludovicianus 7 GMS2277
NA <NA> NA <NA>
NA.1 <NA> NA <NA>
NA.2 <NA> NA <NA>
NA.3 <NA> NA <NA>
NA.4 <NA> NA <NA>
NA.5 <NA> NA <NA>
NA.6 <NA> NA <NA>
NA.7 <NA> NA <NA>
NA.8 <NA> NA <NA>
day.year weight Ltestislength Ltestiswidth Rtestislength Rtestiswidth
42 148 44.3 6 5 9 6
NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA
NA.4 NA NA NA NA NA NA
NA.5 NA NA NA NA NA NA
NA.6 NA NA NA NA NA NA
NA.7 NA NA NA NA NA NA
NA.8 NA NA NA NA NA NA
total.testis.area mtDNA bill.width bill.length tarsus wing nape back rump
42 84 1 10.48 13.25 19.97 99 1 0 1
NA NA NA NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA NA NA NA
NA.4 NA NA NA NA NA NA NA NA NA
NA.5 NA NA NA NA NA NA NA NA NA
NA.6 NA NA NA NA NA NA NA NA NA
NA.7 NA NA NA NA NA NA NA NA NA
NA.8 NA NA NA NA NA NA NA NA NA
flanks breast.underwing.coverts male.total throat.breast.side.of.neck
42 2 2 6 NA
NA NA NA NA NA
NA.1 NA NA NA NA
NA.2 NA NA NA NA
NA.3 NA NA NA NA
NA.4 NA NA NA NA
NA.5 NA NA NA NA
NA.6 NA NA NA NA
NA.7 NA NA NA NA
NA.8 NA NA NA NA
yellow streaking female.total distance passed.genomic.filtering mac.lud.q
42 NA NA NA 378 TRUE 0.499079
NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA
NA.4 NA NA NA NA NA NA
NA.5 NA NA NA NA NA NA
NA.6 NA NA NA NA NA NA
NA.7 NA NA NA NA NA NA
NA.8 NA NA NA NA NA NA
mac.mel.q zinv.hap
42 0.500921 RB
NA NA <NA>
NA.1 NA <NA>
NA.2 NA <NA>
NA.3 NA <NA>
NA.4 NA <NA>
NA.5 NA <NA>
NA.6 NA <NA>
NA.7 NA <NA>
NA.8 NA <NA>
Code
#check out the samples with het z inversionssamps[samps$zinv.hap =="het",]
sample_id institution catalognumber field.number sex
107 P_hybrid_44771 DMNS 44771 GMS-2356 male
93 P_hybrid_45171 DMNS 45171 GMS-2341B male
94 P_hybrid_45173 DMNS 45173 GMS-2342B male
96 P_hybrid_45174 DMNS 45174 GMS-2344_1 male
55 P_ludovicianus_44711 DMNS 44711 GMS-2292 male
92 P_ludovicianus_44754 DMNS 44754 GMS-2335 male
33 P_melanocephalus_44685 DMNS 44685 GMS-2266 male
44 P_melanocephalus_44699 DMNS 44699 GMS-2280 male
ossification notes verbatimeventdate stateprovince
107 100 39238 South Dakota
93 100 39237 South Dakota
94 100 39237 South Dakota
96 100 39237 South Dakota
55 100 39234 South Dakota
92 100 39236 South Dakota
33 100 39229 South Dakota
44 100 39231 South Dakota
verbatimlocality decimallatitude
107 SD; Lyman County; Oacoma 6mi N 43.70914
93 SD; Charles Mix County; Pickstown 20mi ESE 43.16586
94 SD; Gregory County; Platte 20mi E 43.37610
96 SD; Gregory County; Platte 20mi E 43.37347
55 SD; Charles Mix County; Greenwood 1.5mi ESE 42.96253
92 SD; Charles Mix County; Pickstown 20mi ESE 43.16900
33 SD; Mellette County; Belvidere 4mi NW 43.79035
44 SD; Lyman County; Oacoma 9mi N 43.65264
decimallongitude scientificname site
107 -99.43407 Pheucticus melanocephalus x Pheucticus ludovicianus 7
93 -98.81457 Pheucticus melanocephalus x Pheucticus ludovicianus 9
94 -99.16713 Pheucticus melanocephalus x Pheucticus ludovicianus 8
96 -99.16558 Pheucticus melanocephalus x Pheucticus ludovicianus 8
55 -98.46502 Pheucticus ludovicianus 10
92 -98.80748 Pheucticus ludovicianus 9
33 -101.21631 Pheucticus melanocephalus 3
44 -99.46862 Pheucticus melanocephalus 7
recordnumber day.year weight Ltestislength Ltestiswidth Rtestislength
107 GMS2356 156 49.4 11 8 10
93 GMS2341B 155 44.8 12 10 10
94 GMS2342B 155 51.5 14 9 11
96 GMS2344_1 155 45.2 12 8 8
55 GMS2292 152 40.3 10 8 9
92 GMS2335 154 48.9 11 7 8
33 GMS2266 147 48.6 11 10 10
44 GMS2280 149 47.4 14 8 12
Rtestiswidth total.testis.area mtDNA bill.width bill.length tarsus wing
107 8 168 0 11.56 13.49 22.36 104
93 9 210 0 10.33 12.90 19.96 101
94 9 225 1 10.48 13.66 20.25 102
96 8 160 1 10.83 12.32 21.08 99
55 7 143 0 9.30 12.07 19.35 96
92 7 133 0 10.13 12.88 21.49 96
33 9 200 1 10.59 13.62 19.62 108
44 9 220 0 10.31 13.55 19.33 104
nape back rump flanks breast.underwing.coverts male.total
107 1 1 2 1 1 6
93 1 1 1 1 1 5
94 2 2 2 2 3 11
96 0 0 1 1 4 6
55 0 0 0 0 0 0
92 1 1 1 0 0 3
33 2 2 2 2 4 12
44 2 2 2 1 3 10
throat.breast.side.of.neck yellow streaking female.total distance
107 NA NA NA NA 378
93 NA NA NA NA 460
94 NA NA NA NA 425
96 NA NA NA NA 425
55 NA NA NA NA 500
92 NA NA NA NA 460
33 NA NA NA NA 230
44 NA NA NA NA 378
passed.genomic.filtering mac.lud.q mac.mel.q zinv.hap
107 TRUE 0.491159 0.508841 het
93 TRUE 0.718746 0.281254 het
94 TRUE 0.203175 0.796825 het
96 TRUE 0.127704 0.872296 het
55 TRUE 0.892315 0.107685 het
92 TRUE 0.928835 0.071165 het
33 TRUE 0.182445 0.817555 het
44 TRUE 0.372828 0.627172 het
Code
#check out the female samplessamps[samps$sex =="female",]
sample_id institution catalognumber field.number sex
2 P_ludovicianus_21721 KU 114317 female
3 P_ludovicianus_25286 KU 117253 female
7 P_ludovicianus_34779 KU 134038 female
8 P_ludovicianus_34782 KU 134037 female
58 P_ludovicianus_44714 DMNS 44714 GMS-2295 female
59 P_ludovicianus_44715 DMNS 44715 GMS-2296 female
71 P_ludovicianus_44730 DMNS 44730 GMS-2311 female
89 P_ludovicianus_44749 DMNS 44749 GMS-2330 female
10 P_melanocephalus_34890 DMNS 34890 female
28 P_melanocephalus_44679 DMNS 44679 GMS-2260 female
112 P_melanocephalus_44776 DMNS 44776 GMS-2361 female
156 P_melanocephalus_45232 DMNS 45232 GMS-2810 female
ossification notes verbatimeventdate stateprovince
2 100 no phenotype info 39952 Kansas
3 100 no phenotype info 39944 Kansas
7 100 no phenotype info 43233 Kansas
8 100 no phenotype info 43222 Kansas
58 100 39234 South Dakota
59 100 39234 South Dakota
71 100 39235 South Dakota
89 100 39236 South Dakota
10 0 no phenotype info 41488 Colorado
28 100 39221 South Dakota
112 100 39239 South Dakota
156 100 no phenotype info 41051 South Dakota
verbatimlocality decimallatitude
2 Kansas City; 5009 Rockhill Road 39.03528
3 Kansas City; 5009 Rockhill Road 39.03528
7 Johnson County Community College 38.92400
8 Johnson County Community College 38.92400
58 SD; Charles Mix County; Greenwood 1.5mi ESE 42.95152
59 SD; Charles Mix County; Greenwood 1.5mi ESE 42.93722
71 SD; Clay County; Vermillion 5mi N 42.74242
89 SD; Lincoln County; Newton Hills; Canton 3mi N 43.22350
10 9327 Ute Drive; Golden Co 80403 39.86550
28 SD; Lawrence County; Higgins Gulch; Spearfish 1.5mi NE 44.48330
112 SD; Lyman County; Kennebec 12mi N 43.74631
156 SD; Lawrence County; Higgins Gulch; Spearfish 1.5mi NE 44.48330
decimallongitude scientificname site recordnumber day.year
2 -94.57444 Pheucticus ludovicianus 12 NA
3 -94.57444 Pheucticus ludovicianus 12 NA
7 -94.73500 Pheucticus ludovicianus 12 NA
8 -94.73500 Pheucticus ludovicianus 12 NA
58 -98.45596 Pheucticus ludovicianus 10 GMS2295 152
59 -98.43894 Pheucticus ludovicianus 10 GMS2296 152
71 -96.95320 Pheucticus ludovicianus 11 GMS2311 153
89 -96.56357 Pheucticus ludovicianus 11 GMS2330 154
10 -105.28131 Pheucticus melanocephalus 0 NA
28 -103.93330 Pheucticus melanocephalus 1 GMS2260 139
112 -99.75561 Pheucticus melanocephalus 6 GMS2361 157
156 -103.93330 1 NA
weight Ltestislength Ltestiswidth Rtestislength Rtestiswidth
2 NA NA NA NA NA
3 NA NA NA NA NA
7 NA NA NA NA NA
8 NA NA NA NA NA
58 42.5 NA NA NA NA
59 47.8 NA NA NA NA
71 50.0 NA NA NA NA
89 49.8 NA NA NA NA
10 NA NA NA NA NA
28 48.2 NA NA NA NA
112 51.5 NA NA NA NA
156 NA NA NA NA NA
total.testis.area mtDNA bill.width bill.length tarsus wing nape back rump
2 NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA NA
8 NA NA NA NA NA NA NA NA NA
58 0 0 10.22 11.19 21.06 101 NA NA NA
59 0 0 9.84 12.55 20.99 97 NA NA NA
71 0 0 10.51 13.06 19.92 96 NA NA NA
89 0 0 10.42 13.90 21.64 96 NA NA NA
10 NA NA NA NA NA NA NA NA NA
28 0 1 9.69 13.05 22.20 106 NA NA NA
112 0 1 10.54 14.38 21.62 103 NA NA NA
156 NA NA NA NA NA NA NA NA NA
flanks breast.underwing.coverts male.total throat.breast.side.of.neck
2 NA NA NA NA
3 NA NA NA NA
7 NA NA NA NA
8 NA NA NA NA
58 NA NA NA 0
59 NA NA NA 1
71 NA NA NA 1
89 NA NA NA 0
10 NA NA NA NA
28 NA NA NA 3
112 NA NA NA 3
156 NA NA NA NA
yellow streaking female.total distance passed.genomic.filtering mac.lud.q
2 NA NA NA NA TRUE 0.999990
3 NA NA NA NA TRUE 0.999990
7 NA NA NA NA TRUE 0.999990
8 NA NA NA NA TRUE 0.999990
58 0 0 0 500 TRUE 0.999990
59 0 1 2 500 TRUE 0.978407
71 0 0 1 623 TRUE 0.999990
89 1 1 2 636 TRUE 0.983500
10 NA NA NA NA TRUE 0.000010
28 2 3 8 0 TRUE 0.000010
112 1 3 7 354 TRUE 0.003458
156 NA NA NA NA TRUE 0.000010
mac.mel.q zinv.hap
2 0.000010 RB
3 0.000010 RB
7 0.000010 RB
8 0.000010 RB
58 0.000010 RB
59 0.021593 RB
71 0.000010 RB
89 0.016500 backcrossed RB
10 0.999990 BH
28 0.999990 BH
112 0.996542 BH
156 0.999990 BH
Code
#is male phenotype associated with z inversion haplotype?hist(samps$male.total[samps$zinv.hap =="RB"])
Code
hist(samps$male.total[samps$zinv.hap =="BH"])
Code
#is male phenotype associated with mtDNA haplotypehist(samps$male.total[samps$mtDNA ==0])
Code
hist(samps$male.total[samps$mtDNA ==1])
Code
#is phenotype associated with genomic ancestryplot(samps$mac.lud.q[samps$sex =="male"], samps$male.total[samps$sex =="male"])
Both mitochondrial ancestry and Z-chromosome inversion haplotype are tightly associated with genomic ancestry. Interestingly, being heterozygous for the Z-chromosome inversion appears to be OK, and the samples that are het are spread across the transect and across the ancestry distribution. But you never see a sample with mismatched major parent ancestry and homozygous Z inversion (excluding the one 49.9/50.1 sample). Similarly, you only see one really mismatched major parent mitochondrial combination.
0.7 plot the divergence landscape
Code
#check how genotypes are encodedtable(extract.gt(v))
0/0 0/1 1/1
6444858 359942 102641
Code
#extract genotype matrixgtmat<-as.data.frame(extract.gt(v))#recode matrixgtmat[gtmat =="0/0"]<-0gtmat[gtmat =="0/1"]<-1gtmat[gtmat =="1/1"]<-2#make sure order matches between sample sheet and genotype matrixcolnames(gtmat) == samps$sample_id
#open up an empty vector to hold FST resultsFST<-c()#Use a for loop to calculate FST for each SNP in the genotype matrixfor(i in1:nrow(gtmat)){#calc HT#q-bar = global derived allele frequency qbar<-sum(as.numeric(gtmat[i,samps$site ==0| samps$site ==12]), na.rm = T)/(2*sum(!is.na(gtmat[i,samps$site ==0| samps$site ==12])))#insert if statement to catch if the SNP has no variation present among the pops of interestif(qbar ==0| qbar ==1){FST[i]<-"no variation"}else{#p-bar = global reference allele frequency pbar<-((2*sum(!is.na(gtmat[i,samps$site ==0| samps$site ==12]))) -sum(as.numeric(gtmat[i,samps$site ==0| samps$site ==12]), na.rm = T)) / (2*sum(!is.na(gtmat[i,samps$site ==0| samps$site ==12])))#HT = 2 * p-bar * q-bar HT<-2*pbar*qbar#calc HS#calculate subpopulation allele frequencies #p0 = subpopulation 0 alternate allele frequency p0<-sum(as.numeric(gtmat[i,samps$site ==0]), na.rm = T)/(2*sum(!is.na(gtmat[i,samps$site ==0])))#p12 = subpopulation 12 alternate allele frequency p12<-sum(as.numeric(gtmat[i,samps$site ==12]), na.rm = T)/(2*sum(!is.na(gtmat[i,samps$site ==12])))#q0 = subpopulation 0 reference allele frequency q0<-1-p0#q12 = subpopulation 12 reference allele frequency q12<-1-p12#calculate expected heterozygosities#Hexp1 = 1 - sum(p1^2 + q1^2) Hexp0<-1-sum(p0^2+ q0^2)#Hexp2 = 1 - sum(p2^2 + q2^2) Hexp12<-1-sum(p12^2+ q12^2)#HS = (Hexp1*N1 + Hexp2*N2)/Ntotal HS<-((Hexp0*sum(!is.na(gtmat[i,samps$site ==0]))) + (Hexp12*sum(!is.na(gtmat[i,samps$site ==12])))) / (sum(!is.na(gtmat[i,samps$site ==0| samps$site ==12])))#calc and store overall FST for the given SNP#FST = (HT-HS)/HT FST[i]<-(HT-HS)/HT }}#quick double check that results make sensetable(FST =="no variation") #less than half of SNPs are actually variable (i.e., MAC > 0)
#FST is clearly elevated on those chroms #it seems like we don't really have the resolution to get a detailed look at the divergence landscape, which would probably require binning FST in sliding windows and better scaffolding. Best left for a future whole genome follow-up to assess how important the inversion is in separating hybrids and whether it is segregating within each pop at all.
Source Code
---title: "grosbeak investigate Z chromosome"format: html: code-fold: show code-tools: truetoc: truetoc-title: Document Contentsnumber-sections: trueembed-resources: true---### Investigate genotype patterns in hybrids```{r}library(triangulaR)library(vcfR)library(ggplot2)library(reshape2)library(adegenet)library(StAMPP)#read in datav<-read.vcfR("~/Desktop/grosbeak.rad/grosbeak.filtered.snps.vcf.gz")vcolnames(v@gt)#read in sampling dataframesamps<-read.csv("~/Desktop/grosbeak.data.csv")samps<-samps[samps$passed.genomic.filtering =="TRUE",]#make popmappm<-data.frame(id=samps$sample_id,pop=c(rep("P1", times=9),rep("P0",times=8),rep("hyb",times=121)))# Create a new vcfR object composed only of sites above the given allele frequency difference thresholdfixed.vcf <-alleleFreqDiff(vcfR = v, pm = pm, p1 ="P0", p2 ="P1", difference =1)# Calculate hybrid index and heterozygosity for each sample. Values are returned in a data.framehi.het.test <-hybridIndex(vcfR = fixed.vcf, pm = pm, p1 ="P0", p2 ="P1")# View triangle plottriangulaR::triangle.plot(hi.het.test)#see where the fixed differences are coming from:table(fixed.vcf@fix[,1])#overwhelmingly, the fixed differences are coming from two scaffolds: "VZSJ01000457.1" (24) and "VZSJ01000270.1" (11). Let's look at the dynamics of those scaffolds:```### subset the data```{r}### plot without these two over-represented scaffolds includedauto.fixed.vcf<-fixed.vcf[fixed.vcf@fix[,1] !="VZSJ01000457.1"& fixed.vcf@fix[,1] !="VZSJ01000270.1",]auto.fixed.vcf# Calculate hybrid index and heterozygosity for each sample. Values are returned in a data.framehi.het.test <-hybridIndex(vcfR = auto.fixed.vcf, pm = pm, p1 ="P0", p2 ="P1")# View triangle plottriangulaR::triangle.plot(hi.het.test)### plot just those scaffoldsz.fixed.vcf<-fixed.vcf[fixed.vcf@fix[,1] =="VZSJ01000457.1"| fixed.vcf@fix[,1] =="VZSJ01000270.1",]z.fixed.vcf# Calculate hybrid index and heterozygosity for each sample. Values are returned in a data.framehi.het.test <-hybridIndex(vcfR = z.fixed.vcf, pm = pm, p1 ="P0", p2 ="P1")# View triangle plottriangulaR::triangle.plot(hi.het.test)hist(hi.het.test$heterozygosity)#make genotype plot for first scaffoldtest.only<-fixed.vcf[fixed.vcf@fix[,1] =="VZSJ01000457.1",]#isolate gt matrixgt<-as.data.frame(t(extract.gt(test.only)))gt$sample<-rownames(gt)dat3 <-melt(gt, id.var ='sample')#plotggplot(dat3, aes(variable, sample)) +geom_tile(aes(fill = value), colour ="white") +scale_fill_manual(values=c("lightpink", "red", "black")) +theme(axis.text.y =element_text(size =5))#repeattest.only<-fixed.vcf[fixed.vcf@fix[,1] =="VZSJ01000270.1",]#isolate gt matrixgt<-as.data.frame(t(extract.gt(test.only)))gt$sample<-rownames(gt)dat3 <-melt(gt, id.var ='sample')#plotggplot(dat3, aes(variable, sample)) +geom_tile(aes(fill = value), colour ="white") +scale_fill_manual(values=c("lightpink", "red", "black")) +theme(axis.text.y =element_text(size =5))#create sample sheet with Z chrom inversion state includedzinv<-data.frame(id=colnames(test.only@gt)[-1],inv.state=c("RB","BH",rep("RB",4),"BH","het","BH",rep("het",3),rep("RB",13),"het",rep("RB",28),"backcrossed RB","RB","het","RB","RB",rep("BH",17),"het",rep("BH",9),"het",rep("BH",46),"RB",rep("BH",4)))```The fact that all samples seem to be either completely homozygous or completely heterozygous for fixed differences on these scaffolds suggests that this is possibly an inversion, where the whole region is inherited as a haplotype with no recombination.These two plots are highly similar, suggesting that these two scaffolds are potentially linked. To assess this, I am going BLAST them to the chromosome assembled Zebra Finch reference genome.### blast results```{r}#Blasting > 100k base-pairs of each of these two interesting scaffolds against the Zebra Finch reference genome revealed highly significant hits for both scaffolds on the Z chromosome:knitr::include_graphics(c("/Users/devonderaad/Desktop/blast1.png"))knitr::include_graphics(c("/Users/devonderaad/Desktop/blast2.png"))#the scaffolds span at least 4 Mb, from 59 Mb to 63 Mb on the Zebra finch Z scaffold, suggesting a large inversion present in the center of the Z chromosome for these two Pheucticus taxa```### make trees from the putative inverted region```{r}#Isolate Zz.only<-v[v@fix[,1] =="VZSJ01000457.1"| v@fix[,1] =="VZSJ01000270.1",]#get info for this datasetz.only#z.only<-min_mac(z.only,2)#convert to genlightgen<-vcfR2genlight(z.only)#fix sample names to fit in <= 10 charactersgen@ind.namesgen@ind.names<-gsub("P_hybrid_","hyb", gen@ind.names)gen@ind.names<-gsub("P_ludovicianus_","lud", gen@ind.names)gen@ind.names<-gsub("P_melanocephalus_","mel", gen@ind.names)gen@ind.namespop(gen)<-gen@ind.names#assign populations (a StaMPP requirement)gen@pop<-as.factor(gen@ind.names)#generate pairwise divergence matrixsample.div <-stamppNeisD(gen, pop =FALSE)#export for splitstree#stamppPhylip(distance.mat=sample.div, file="~/Desktop/grosbeak.rad/z.only.splits.txt")library(poppr)ploidy(gen)<-2genetic_distance_matrix <- poppr::bitwise.dist(gen, mat =TRUE)#plot the two approaches to calculating genetic divergenceplot(nj(genetic_distance_matrix), type="unrooted")plot(nj(sample.div), type="unrooted")#both trees show the expected pattern (three discrete clusters)```### add genomic ancestry info to dataframe```{r}#read in admixture results#setwd to the admixture directory you brought in from the clustersetwd("~/Desktop/grosbeak.rad/admixture.mac")#read in input filesampling<-read.table("binary_fileset.fam")[,1]#get list of input samples in order they appearsampling#read in all ten runs and save each dataframe in a listruns<-list()#read in log filesfor (i in1:10){ runs[[i]]<-read.table(paste0("binary_fileset.", i, ".Q"))}#isolate run 2 (best according to CV)run2<-runs[[2]]#add sample info in the correct order (same as input vcf)run2$sample<-colnames(v@gt)[-1]#reordersamps$sample_id == run2$sample #check if sample info table order matches the vcfsamps<-samps[match(run2$sample,samps$sample_id),] #use 'match' to match orderssamps$sample_id == run2$sample #check if this worked#add q-values to the sampling data.frame now that orders matchsamps$mac.lud.q<-run2$V1samps$mac.mel.q<-run2$V2#check to see that order matches with the z-inversion dataframe I made abovesamps$sample_id == zinv$id #check if this worked#if all above are true, add in z-inversion haplotype to the dataframesamps$zinv.hap<-zinv$inv.state```### assess inversion dynamics```{r}#how does genomic ancestry correspond to inversion haplotype in femaleshist(samps$mac.mel.q[samps$zinv.hap =="BH"& samps$sex =="female"], breaks=50)hist(samps$mac.mel.q[samps$zinv.hap =="RB"& samps$sex =="female"], breaks=50)#what about in males?hist(samps$mac.mel.q[samps$zinv.hap =="BH"& samps$sex =="male"], breaks=50)hist(samps$mac.mel.q[samps$zinv.hap =="RB"& samps$sex =="male"], breaks=50)hist(samps$mac.mel.q[samps$zinv.hap =="het"], breaks=50)#how does mitochondrial ancestry correspond to inversion haplotype in femalessamps$mtDNA[samps$zinv.hap =="BH"& samps$sex =="female"]samps$mtDNA[samps$zinv.hap =="RB"& samps$sex =="female"]#how does mitochondrial ancestry correspond to inversion haplotype in femalessamps$mtDNA[samps$zinv.hap =="BH"& samps$sex =="male"]samps$mtDNA[samps$zinv.hap =="RB"& samps$sex =="male"]samps$mtDNA[samps$zinv.hap =="het"& samps$sex =="male"]#how does genomic ancestry correspond to mitochondrial haplotype in femaleshist(samps$mac.mel.q[samps$mtDNA ==0& samps$sex =="female"], breaks=50)hist(samps$mac.mel.q[samps$mtDNA ==1& samps$sex =="female"], breaks=50)#how does genomic ancestry correspond to mitochondrial haplotype in maleshist(samps$mac.mel.q[samps$mtDNA ==0& samps$sex =="male"], breaks=50)hist(samps$mac.mel.q[samps$mtDNA ==1& samps$sex =="male"], breaks=50)#check out the one weird sample with 50/50 ancestry, 50/50 phenotype, RB Z inversion and BH mtDNAsamps[samps$zinv.hap =="RB"& samps$mtDNA ==1,]#check out the samples with het z inversionssamps[samps$zinv.hap =="het",]#check out the female samplessamps[samps$sex =="female",]#is male phenotype associated with z inversion haplotype?hist(samps$male.total[samps$zinv.hap =="RB"])hist(samps$male.total[samps$zinv.hap =="BH"])#is male phenotype associated with mtDNA haplotypehist(samps$male.total[samps$mtDNA ==0])hist(samps$male.total[samps$mtDNA ==1])#is phenotype associated with genomic ancestryplot(samps$mac.lud.q[samps$sex =="male"], samps$male.total[samps$sex =="male"])plot(samps$mac.lud.q[samps$sex =="female"], samps$female.total[samps$sex =="female"])#how frequent is genomic versus phenotypic intermediacy?table(samps$male.total)table(round(samps$mac.lud.q, 2))#what proportion of samples are detectably admixed?table(samps$mac.lud.q >0.02& samps$mac.lud.q <0.98)table(samps$mac.lud.q >0.05& samps$mac.lud.q <0.95)#what sites are those detectably admixed samples fromtable(samps$site[samps$mac.lud.q >0.02& samps$mac.lud.q <0.98])table(samps$site[samps$mac.lud.q >0.05& samps$mac.lud.q <0.95])table(samps$site)```Both mitochondrial ancestry and Z-chromosome inversion haplotype are tightly associated with genomic ancestry. Interestingly, being heterozygous for the Z-chromosome inversion appears to be OK, and the samples that are het are spread across the transect and across the ancestry distribution. But you never see a sample with mismatched major parent ancestry and homozygous Z inversion (excluding the one 49.9/50.1 sample). Similarly, you only see one really mismatched major parent mitochondrial combination.### plot the divergence landscape```{r}#check how genotypes are encodedtable(extract.gt(v))#extract genotype matrixgtmat<-as.data.frame(extract.gt(v))#recode matrixgtmat[gtmat =="0/0"]<-0gtmat[gtmat =="0/1"]<-1gtmat[gtmat =="1/1"]<-2#make sure order matches between sample sheet and genotype matrixcolnames(gtmat) == samps$sample_id#open up an empty vector to hold FST resultsFST<-c()#Use a for loop to calculate FST for each SNP in the genotype matrixfor(i in1:nrow(gtmat)){#calc HT#q-bar = global derived allele frequency qbar<-sum(as.numeric(gtmat[i,samps$site ==0| samps$site ==12]), na.rm = T)/(2*sum(!is.na(gtmat[i,samps$site ==0| samps$site ==12])))#insert if statement to catch if the SNP has no variation present among the pops of interestif(qbar ==0| qbar ==1){FST[i]<-"no variation"}else{#p-bar = global reference allele frequency pbar<-((2*sum(!is.na(gtmat[i,samps$site ==0| samps$site ==12]))) -sum(as.numeric(gtmat[i,samps$site ==0| samps$site ==12]), na.rm = T)) / (2*sum(!is.na(gtmat[i,samps$site ==0| samps$site ==12])))#HT = 2 * p-bar * q-bar HT<-2*pbar*qbar#calc HS#calculate subpopulation allele frequencies #p0 = subpopulation 0 alternate allele frequency p0<-sum(as.numeric(gtmat[i,samps$site ==0]), na.rm = T)/(2*sum(!is.na(gtmat[i,samps$site ==0])))#p12 = subpopulation 12 alternate allele frequency p12<-sum(as.numeric(gtmat[i,samps$site ==12]), na.rm = T)/(2*sum(!is.na(gtmat[i,samps$site ==12])))#q0 = subpopulation 0 reference allele frequency q0<-1-p0#q12 = subpopulation 12 reference allele frequency q12<-1-p12#calculate expected heterozygosities#Hexp1 = 1 - sum(p1^2 + q1^2) Hexp0<-1-sum(p0^2+ q0^2)#Hexp2 = 1 - sum(p2^2 + q2^2) Hexp12<-1-sum(p12^2+ q12^2)#HS = (Hexp1*N1 + Hexp2*N2)/Ntotal HS<-((Hexp0*sum(!is.na(gtmat[i,samps$site ==0]))) + (Hexp12*sum(!is.na(gtmat[i,samps$site ==12])))) / (sum(!is.na(gtmat[i,samps$site ==0| samps$site ==12])))#calc and store overall FST for the given SNP#FST = (HT-HS)/HT FST[i]<-(HT-HS)/HT }}#quick double check that results make sensetable(FST =="no variation") #less than half of SNPs are actually variable (i.e., MAC > 0)table(FST)#make dataframeFST.dataframe<-data.frame(chrom=v@fix[,1],pos=v@fix[,2],FST=FST)#isolate variable SNPsFST.dataframe<-FST.dataframe[FST.dataframe$FST !="no variation",]head(FST.dataframe)#make numericFST.dataframe$FST<-as.numeric(FST.dataframe$FST)head(FST.dataframe)#plot histogram of FSThist(FST.dataframe$FST)#plot divergence landscapeplot(1:nrow(FST.dataframe), FST.dataframe$FST, cex=0.1, pch=19)plot(1:nrow(FST.dataframe), FST.dataframe$FST, cex=0.1, pch=19, col=as.factor(FST.dataframe$chrom))palette(c("grey","black"))plot(1:nrow(FST.dataframe), FST.dataframe$FST, cex=0.1, pch=19, col=as.factor(FST.dataframe$chrom))#explorer1<-with(FST.dataframe, tapply(FST, chrom, mean))hist(r1)r1[r1 > .5]hist(table(FST.dataframe$chrom), breaks=100)table(FST.dataframe$chrom[FST.dataframe$FST ==1])#check whether FST is elevated on the putative inversion chromosomes versus the rest of the genomemean(FST.dataframe$FST)mean(FST.dataframe$FST[FST.dataframe$chrom =="VZSJ01000457.1"])mean(FST.dataframe$FST[FST.dataframe$chrom =="VZSJ01000270.1"])#FST is clearly elevated on those chroms #it seems like we don't really have the resolution to get a detailed look at the divergence landscape, which would probably require binning FST in sliding windows and better scaffolding. Best left for a future whole genome follow-up to assess how important the inversion is in separating hybrids and whether it is segregating within each pop at all.```